Optimise cache and precision for DL training#186
Optimise cache and precision for DL training#186prockenschaub wants to merge 7 commits intorvandewater:developmentfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 20 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis PR refactors Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
icu_benchmarks/data/loader.py (2)
115-134:⚠️ Potential issue | 🔴 CriticalCritical: in-place mutation of cached
_label_arrays[stay_id]corruptspad_maskon subsequent calls.On Line 116,
labelsis a live reference to the array stored inself._label_arrays[stay_id]. Whenlen(labels) > 1andlength_diff <= 0(i.e., a stay whose length equalsmaxlen), neither thelen(labels) == 1nor thelength_diff > 0branch allocates a new array, solabels[not_labeled] = -1on Line 133 mutates the cached entry in place.After the first call, the cached labels have
-1where NaNs used to be. On any subsequent rebuild (e.g., whenram_cache=False, or ifram_cache(True)is ever called again),np.argwhere(np.isnan(labels))returns an empty array, so thepad_mask[not_labeled] = 0line on Line 134 is skipped andpad_maskis returned as all ones at positions that should have been masked out — silently feeding unlabeled timesteps into the loss.
windowis never mutated, so onlylabelsneeds a defensive copy.🐛 Proposed fix
window = self._feat_arrays[stay_id] - labels = self._label_arrays[stay_id] + # Copy to avoid mutating the cached entry below (labels[not_labeled] = -1). + labels = self._label_arrays[stay_id].copy()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@icu_benchmarks/data/loader.py` around lines 115 - 134, The issue is in-place mutation of the cached labels array (self._label_arrays[stay_id]) causing persistent -1s that break pad_mask later; fix by making a defensive copy of the labels before any mutation (e.g., assign labels = self._label_arrays[stay_id].copy() or np.array(self._label_arrays[stay_id], copy=True) immediately after retrieving it), and ensure any branches that build new label arrays (the len(labels) == 1 concatenation and the length_diff > 0 padding branch) operate on and return this copied/allocated array so subsequent writes like labels[not_labeled] = -1 do not mutate the cached self._label_arrays entry (refer to variables stay_id, labels, self._label_arrays, not_labeled, pad_mask).
174-195:⚠️ Potential issue | 🔴 CriticalCritical: float32 cast of
row_indicatorscorrupts stay-id precision in exported predictions.
row_indicatorscarries theGROUP(stay_id) and, for temporal data,SEQUENCE(time in hours). These are actively passed to_save_model_outputs()and exported topred_indicators.csvfor downstream analysis. float32 can only represent integers exactly up to 2²⁴ ≈ 16.7 million; hospital-derived stay ids (e.g., MIMIC-IV) routinely exceed this. Unconditionally casting to float32 on line 194 silently collapses distinct ids, corrupting the CSV export and breaking any attempt to match predictions back to original stays.This is a regression: commit 52117f4 introduced conditional mps-based casting (
if self.mps: ...else: native dtype), but the self.mps flag (line 53) is set but never used into_tensor(). The condition was removed and replaced with unconditional float32 casting.Only
dataandlabelsrequire float32 for tensor operations;row_indicatorsis metadata and should retain its native dtype.Fix for
to_tensordef to_tensor(self) -> tuple[Tensor, Tensor, Tensor]: data, labels, row_indicators = self.get_data_and_labels() - # Always use float32 for memory efficiency and MPS compatibility return ( from_numpy(data), from_numpy(labels), - from_numpy(row_indicators.astype(np.float32)), + from_numpy(row_indicators), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@icu_benchmarks/data/loader.py` around lines 174 - 195, The to_tensor method currently forces row_indicators to float32 which corrupts stay_id precision; update to_tensor (and/or get_data_and_labels usage) so only data and labels are cast to float32 before creating tensors and row_indicators is passed through with its native integer dtype (do not call .astype(np.float32) on row_indicators). Use the existing self.mps flag if you need MPS-specific casting for data/labels, but ensure row_indicators stays as the original numpy dtype when calling from_numpy(row_indicators) so exported pred_indicators.csv keeps exact stay_id/sequence values.
🧹 Nitpick comments (1)
icu_benchmarks/models/train.py (1)
114-131: Consider gatingpin_memoryon accelerator availability.When
cpu=Truethe trainer runs on CPU and pinning is wasted work (and can emit warnings on some backends). A trivial gate avoids this:- persistent_workers=persistent_workers, - pin_memory=True, + persistent_workers=persistent_workers, + pin_memory=not cpu,…applied to both the train and val loaders (the test loader at Line 199 already runs inside the same
cpuscope and could be tightened similarly).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@icu_benchmarks/models/train.py` around lines 114 - 131, The DataLoader instances (train_loader and val_loader) set pin_memory=True unconditionally which wastes work or triggers warnings when running on CPU; change both DataLoader constructions to set pin_memory = (not cpu) or equivalent so pinning is enabled only when an accelerator/GPU is used (reference DataLoader, train_loader, val_loader and the cpu/trainer flag), and apply the same gating pattern used for the test loader scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@icu_benchmarks/data/loader.py`:
- Around line 115-134: The issue is in-place mutation of the cached labels array
(self._label_arrays[stay_id]) causing persistent -1s that break pad_mask later;
fix by making a defensive copy of the labels before any mutation (e.g., assign
labels = self._label_arrays[stay_id].copy() or
np.array(self._label_arrays[stay_id], copy=True) immediately after retrieving
it), and ensure any branches that build new label arrays (the len(labels) == 1
concatenation and the length_diff > 0 padding branch) operate on and return this
copied/allocated array so subsequent writes like labels[not_labeled] = -1 do not
mutate the cached self._label_arrays entry (refer to variables stay_id, labels,
self._label_arrays, not_labeled, pad_mask).
- Around line 174-195: The to_tensor method currently forces row_indicators to
float32 which corrupts stay_id precision; update to_tensor (and/or
get_data_and_labels usage) so only data and labels are cast to float32 before
creating tensors and row_indicators is passed through with its native integer
dtype (do not call .astype(np.float32) on row_indicators). Use the existing
self.mps flag if you need MPS-specific casting for data/labels, but ensure
row_indicators stays as the original numpy dtype when calling
from_numpy(row_indicators) so exported pred_indicators.csv keeps exact
stay_id/sequence values.
---
Nitpick comments:
In `@icu_benchmarks/models/train.py`:
- Around line 114-131: The DataLoader instances (train_loader and val_loader)
set pin_memory=True unconditionally which wastes work or triggers warnings when
running on CPU; change both DataLoader constructions to set pin_memory = (not
cpu) or equivalent so pinning is enabled only when an accelerator/GPU is used
(reference DataLoader, train_loader, val_loader and the cpu/trainer flag), and
apply the same gating pattern used for the test loader scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5bda1778-f36e-40e7-ac75-7b6de4aded30
📒 Files selected for processing (2)
icu_benchmarks/data/loader.pyicu_benchmarks/models/train.py
Addresses #184 and #185 to improve training on GPU cluster.
Also fixes everything to
float32for both memory efficiency as well as compatibility with MPS.Summary by CodeRabbit
Release Notes